Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine api mvp #119

Merged
merged 3 commits into from
Sep 13, 2018
Merged

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Aug 10, 2018

This PR provides a minimum functional integration for the installer with the machine API and machineSets.

It removes the terraform step for creating workers machines on AWS so after bootstrapping a cluster with openshift install the worker machines are managed by a machineSet object so we can have an early e2e test driven workflow

Follow ups:

  • Loads of fixes for the underlying tooling(machine-api-actuator, aws actuator)
  • Discuss/Consolidate operator config with machineConfig
  • Discuss how we deal with the machineSet object (is it owned by machineConfig operator? is it owned by machine-api-operator along with machine-api components? is it owned by its custom operator?)

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 10, 2018
@enxebre
Copy link
Member Author

enxebre commented Aug 10, 2018

@trawler
Copy link
Contributor

trawler commented Aug 13, 2018

/retest

Copy link
Member

@ingvagabund ingvagabund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not completely familiar with the tf/ign and parts of the code. Though, left some comments and questions.

@@ -50,6 +50,7 @@ module "bootkube" {
service_account_private_key_pem = "${local.service_account_private_key_pem}"

etcd_endpoints = "${data.template_file.etcd_hostname_list.*.rendered}"
worker_ign_config = "${file("worker.ign")}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who/what creates the worker.ign file? What is inside the file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, most likely generateIgnConfigStep is responsible for that, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct :)

- name: machine-api-operator
image: quay.io/alberto_lamela/machine-api-operator:a661677 # TODO: move this to openshift org
command:
- "/machine-api-operator"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rule to put all the binaries under /? I noticed his pattern is used on multiple places (even in cluster-api upstream repo). IMHO, whenever possible I prefer /usr/bin/ or /bin. It's more intuitive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the convention is that all tectonic operator binaries are built under the image's root folder.

@@ -17,6 +17,9 @@ variable "manifest_names" {
"tectonic-node-controller-operator.yaml",
"tnc-tls-secret.yaml",
"ign-config.yaml",
"app-version-mao.yaml",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mao is not so far from Mao. What about mao -> machine-api-operator. It's not so long even with the additional chars.

@trawler
Copy link
Contributor

trawler commented Aug 13, 2018

/retest

2 similar comments
@trawler
Copy link
Contributor

trawler commented Aug 14, 2018

/retest

@enxebre
Copy link
Member Author

enxebre commented Aug 14, 2018

/retest

@enxebre
Copy link
Member Author

enxebre commented Aug 14, 2018

Getting:

ip-10-0-144-110.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-168-133.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-81-142.ec2.internal    Ready     master    3m        v1.11.0+d4cacc0
Waiting for router to be created ...
NAME                           STATUS    ROLES     AGE       VERSION
ip-10-0-139-152.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-144-110.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-168-133.ec2.internal   Ready     etcd      3m        v1.11.0+d4cacc0
ip-10-0-81-142.ec2.internal    Ready     master    3m        v1.11.0+d4cacc0
NAME      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
router    1         1         1            0           10s
error: timed out waiting for the condition
2018/08/14 13:15:57 Container setup in pod e2e-aws failed, exit code 1, reason Error
Another process exited
2018/08/14 13:16:02 Container test in pod e2e-aws failed, exit code 1, reason Error

@yifan-gu @smarterclayton would you be able to help to get this green? is there any way we can look at the kube logs to troubleshoot

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 14, 2018 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 14, 2018 via email

@smarterclayton
Copy link
Contributor

/retest

@enxebre
Copy link
Member Author

enxebre commented Aug 14, 2018

@smarterclayton thanks for help! we need to rerun test to include @trawler fix openshift/machine-api-operator@d84ba81

This PR runs only the workers as a machineSet. The machine-api-operator and the cluster-api stack run on masters so it can be debugged getting the logs for the relevant pods when workers does not come up.
For doing the same on CI we'd need some guidance on how to localise the cluster

@smarterclayton
Copy link
Contributor

The teardown scripts use the kuberentes API to find and retrieve logs - we grab all pods, all node logs, all events, and a number of other things without looking at AWS info.

@smarterclayton
Copy link
Contributor

/test e2e-aws

Failed due to something about reuse

@@ -8,6 +8,7 @@ data:
clusterName: ${cluster_name}
clusterDomain: ${cluster_domain}
region: ${region}
image: ${image}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre Is there is path to upgrade the image when cluster updates?

@enxebre
Copy link
Member Author

enxebre commented Aug 14, 2018

Looking better now. Jenkins test actually passed, how ever it fails to destroy as workers are not under tf control now

@enxebre
Copy link
Member Author

enxebre commented Aug 14, 2018

/test build-tarball

@enxebre
Copy link
Member Author

enxebre commented Aug 14, 2018

All test are passing now. Jenkins ones show red because terraform is failing to destroy as it does not know how to handle the machines created by the machineSet. We could add kubectl here https://github.com/openshift/installer/blob/master/images/tectonic-smoke-test-env/Dockerfile and kubectl scale machineSet replicas=0 here https://github.com/openshift/installer/blob/master/tests/run.sh#L20 cc @paulfantom @trawler @ingvagabund @yifan-gu

@paulfantom
Copy link
Contributor

I wonder why ci/prow/e2e-aws tests are green and Jenkins ones are failing. Is destroying cluster tested only in Jenkins?

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 15, 2018
@trawler
Copy link
Contributor

trawler commented Aug 15, 2018

Jenkins fails with account limit error:

* aws_elb.api_external: TooManyLoadBalancers: Exceeded quota of account 846518947292
status code: 400, request id: adf75e45-a05e-11e8-b6ea-1b1a348c0d2c
* module.vpc.aws_nat_gateway.nat_gw[0]: 1 error(s) occurred:
 
* aws_nat_gateway.nat_gw.0: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
status code: 400, request id: 1a23a804-f99c-4392-9e94-083bffcadaec

@ingvagabund
Copy link
Member

retest this please

1 similar comment
@paulfantom
Copy link
Contributor

retest this please

@trawler
Copy link
Contributor

trawler commented Aug 15, 2018

@enxebre I found that this PR (as it is now) breaks the behavior of the destroy function of the installer... If terraform fails (for any reason) to complete the installation, it is impossible to then run installer destroy since the destroy workflow now looks for the cluster API, that never came up.

@trawler
Copy link
Contributor

trawler commented Aug 15, 2018

/test unit

@trawler
Copy link
Contributor

trawler commented Aug 15, 2018

/test e2e-aws

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2018
@abhinavdahiya
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 13, 2018
@openshift-merge-robot openshift-merge-robot merged commit b6a182a into openshift:master Sep 13, 2018
@bison bison deleted the machine-api-mvp branch September 13, 2018 21:54
@@ -160,3 +160,9 @@ variable "pull_secret" {
type = "string"
description = "Your pull secret. Obtain this from your Tectonic Account: https://account.coreos.com."
}

variable "worker_ign_config" {
description = "Worker ignition config"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this temporary? I'd have expected the machine API operator to manage worker ignition configs on its own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not temporary to my knowledge. I'm not sure why the machine-api-operator would manage the contents of the ignition configs. The configs are just an input to be passed along to the MachineSets. If anything, I would expect the machine-config-operator (this is getting confusing...) to manage the ignition configs and provide them in some way to the machine-api-operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, I would expect the machine-config-operator (this is getting confusing...) to manage the ignition configs and provide them in some way to the machine-api-operator.

That makes sense to me. @abhinavdahiya?

wking added a commit to wking/openshift-installer that referenced this pull request Sep 20, 2018
… cluster-api

Generated with:

  $ glide update --strip-vendor
  $ glide-vc --use-lock-file --no-tests --only-code
  $ bazel run //:gazelle

using:

  $ glide --version
  (cd $GOPATH/src/github.com/Masterminds/glide && git describe)
  v0.13.1-7-g3e13fd1
  $ (cd $GOPATH/src/github.com/sgotti/glide-vc && git describe)
  v0.1.0-2-g6ddf6ee
  $ bazel version
  Build label: 0.16.1- (@non-git)
  Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
  Build time: Mon Aug 13 16:42:29 2018 (1534178549)
  Build timestamp: 1534178549
  Build timestamp as int: 1534178549

The tectonic-node-controller removal catches us up with 596591b (.*:
replace tectonic node controller with machine config operator,
2018-09-10, openshift#232).

The cluster-api trim adjusts the content from b00e40e (vendor: Add
client from sigs.k8s.io/cluster-api, 2018-09-04, openshift#119).  Because
cluster-api wasn't in glide.lock, I suspect neither glide nor glide-vc
were run before that commit.
wking added a commit to wking/openshift-installer that referenced this pull request Sep 20, 2018
… cluster-api

Generated with:

  $ glide update --strip-vendor
  $ glide-vc --use-lock-file --no-tests --only-code
  $ bazel run //:gazelle

using:

  $ glide --version
  (cd $GOPATH/src/github.com/Masterminds/glide && git describe)
  v0.13.1-7-g3e13fd1
  $ (cd $GOPATH/src/github.com/sgotti/glide-vc && git describe)
  v0.1.0-2-g6ddf6ee
  $ bazel version
  Build label: 0.16.1- (@non-git)
  Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
  Build time: Mon Aug 13 16:42:29 2018 (1534178549)
  Build timestamp: 1534178549
  Build timestamp as int: 1534178549

The tectonic-node-controller removal catches us up with 596591b (.*:
replace tectonic node controller with machine config operator,
2018-09-10, openshift#232).

The cluster-api trim adjusts the content from b00e40e (vendor: Add
client from sigs.k8s.io/cluster-api, 2018-09-04, openshift#119).  Because
cluster-api wasn't in glide.lock, I suspect neither glide nor glide-vc
were run before that commit.
wking added a commit to wking/openshift-installer that referenced this pull request Sep 20, 2018
… cluster-api

Generated with:

  $ glide update --strip-vendor
  $ glide-vc --use-lock-file --no-tests --only-code
  $ bazel run //:gazelle

using:

  $ glide --version
  (cd $GOPATH/src/github.com/Masterminds/glide && git describe)
  v0.13.1-7-g3e13fd1
  $ (cd $GOPATH/src/github.com/sgotti/glide-vc && git describe)
  v0.1.0-2-g6ddf6ee
  $ bazel version
  Build label: 0.16.1- (@non-git)
  Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
  Build time: Mon Aug 13 16:42:29 2018 (1534178549)
  Build timestamp: 1534178549
  Build timestamp as int: 1534178549

The tectonic-node-controller removal catches us up with 596591b (.*:
replace tectonic node controller with machine config operator,
2018-09-10, openshift#232).

The cluster-api trim adjusts the content from b00e40e (vendor: Add
client from sigs.k8s.io/cluster-api, 2018-09-04, openshift#119).  Because
cluster-api wasn't in glide.lock, I suspect neither glide nor glide-vc
were run before that commit.
wking added a commit to wking/openshift-installer that referenced this pull request Nov 27, 2018
The last consumers for these were removed by 124ac35 (*: Use
machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).
wking added a commit to wking/openshift-installer that referenced this pull request Nov 27, 2018
The last consumers for these were removed by 124ac35 (*: Use
machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).
wking added a commit to wking/openshift-installer that referenced this pull request Nov 27, 2018
And the related, libvirt-specific, tectonic_libvirt_worker_ips.  This
simplifies the Terraform logic for AWS and OpenStack, and consistently
pushes most worker setup into the cluster-API providers who are
creating the workers since 124ac35 (*: Use machine-api-operator to
deploy worker nodes, 2018-09-04, openshift#119).
wking added a commit to wking/openshift-installer that referenced this pull request Nov 27, 2018
And the related, libvirt-specific, tectonic_libvirt_worker_ips.  This
simplifies the Terraform logic for AWS and OpenStack, and consistently
pushes most worker setup into the cluster-API providers who are
creating the workers since 124ac35 (*: Use machine-api-operator to
deploy worker nodes, 2018-09-04, openshift#119).
wking added a commit to wking/openshift-installer that referenced this pull request Nov 28, 2018
And the related, libvirt-specific, tectonic_libvirt_worker_ips.  This
simplifies the Terraform logic for AWS and OpenStack, and consistently
pushes most worker setup into the cluster-API providers who are
creating the workers since 124ac35 (*: Use machine-api-operator to
deploy worker nodes, 2018-09-04, openshift#119).
wking added a commit to wking/openshift-installer that referenced this pull request Nov 29, 2018
The last consumers for these were removed by 124ac35 (*: Use
machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).
wking added a commit to wking/openshift-installer that referenced this pull request Nov 29, 2018
And the related, libvirt-specific, tectonic_libvirt_worker_ips.  This
simplifies the Terraform logic for AWS and OpenStack, and consistently
pushes most worker setup into the cluster-API providers who are
creating the workers since 124ac35 (*: Use machine-api-operator to
deploy worker nodes, 2018-09-04, openshift#119).
EmilienM pushed a commit to shiftstack/installer that referenced this pull request Dec 8, 2020
EmilienM pushed a commit to shiftstack/installer that referenced this pull request Dec 8, 2020
BUG 1877743: Bump K8s dependencies to v1.19.0
clnperez added a commit to clnperez/installer that referenced this pull request Mar 12, 2022
…improvements

Improvements for the PowerVS asset/installconfig package
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet